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Abstract 

Recent work suggests that some auto-encoder variants do a good job of cap- 
turing the local manifold structure of the unknown data generating density. This 
paper contributes to the mathematical understanding of this phenomenon and helps 
define better justified sampling algorithms for deep learning based on auto-encoder 
variants. We consider an MCMC where each step samples from a Gaussian whose 
mean and covariance matrix depend on the previous state, defines through its 
asymptotic distribution a target density. First, we show that good choices (in the 
sense of consistency) for these mean and covariance functions are the local ex- 
pected value and local covariance under that target density. Then we show that an 
auto-encoder with a contractive penalty captures estimators of these local moments 
in its reconstruction function and its Jacobian. A contribution of this work is thus 
a novel alternative to maximum-likelihood density estimation, which we call local 
moment matching. It also justifies a recently proposed sampling algorithm for the 
Contractive Auto-Encoder and extends it to the Denoising Auto-Encoder. 

1 Introduction 

Machine learning is about capturing aspects of the unknown distribution from which 
the observed data are sampled (the data- gene rating distribution). For many learning 
algorithms and in particular in manifold learning, the focus is on identifying the regions 
(sets of points) in the space of examples where this distribution concentrates, i.e., which 
configurations of the observed variables are plausible. 

Unsupervised representation-learning algorithms attempt to characterize the data- 
generating distribution through the discovery of a set of features or latent variables 
whose variations capture most of the structure of the data-generating distribution. In 
recent years, a number of unsupervised feature learning algorithms have been proposed 
that are based on minimizing some form of reconstruction error, such as auto-encoder 



and sparse coding variants (Bengio et al. 


2007 1 Ranzato et al. 2007 Jain and Seung 


2008 , Ranzato et al. 2008| Vincent et al. 


2008| fKavukcuoglu et al. 20091 R ifai et al 
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201 la|b|c Gregor et al. 20111. An auto-encoder reconstructs the input through two 



stages, an encoder function / (which outputs a learned representation h = f(x) of an 
example x) and a decoder function g, such that g(f(x)) ps x for most x sampled from 
the data-generating distribution. These feature learning algorithms can be stacked to 
form deeper and more abstract representations. There are arguments and much em- 
pirical evidence to suggest that when they are well-trained, such deep learning algo- 
rithms ( |Hinton et aTl|2006||Bengio[|2009l|Lee et aZ."1|2009[|Salakhutdinov and Hinton] 
2009} can perform better than their shallow counterparts, both in terms of learning fea- 
tures for the purpose of classification tasks and for generating higher-quality samples. 

Here we restrict ourselves to the case of continuous inputs x <E M. d with the data- 
generating distribution being associated with an unknown target density function, de- 
noted p. Manifold learning algorithms assume that p is concentrated in regions of lower 
dimension (Cayt on}|2005j Narayanan and Mitter, 2010), i.e., the training examples are 
by definition located very close to these high-density manifolds. In that context, the 
core objective of manifold learning algorithms is to identify the local directions of 
variation, such that small movement in input space along these directions stays on or 
near the high-density manifold. 

Some important questions remain concerning many of these feature learning algo- 
rithms. What is their training criterion learning about the input density! Do these 
algorithms implicitly learn about the whole density or only some aspect? If they cap- 
ture the essence of the target density, then can we formalize that link and in particular 
exploit it to sample from the model? This would turn these algorithms into implicit den- 
sity models, which only define a density indirectly, e.g., through a generative procedure 
that converges to it. These are the questions to which this paper contributes. 

A crucial starting point for this work is very recent work ( Rifai et ai\ 2012) ) propos- 
ing a sampling algorithm for Contractive Auto-Encoders, detailed in the next section. 
This algorithm was motivated on geometrical grounds, based on the observation and 
intuition that the leading singular vectors of the Jacobian of the encoder function spec- 
ify those main directions of variation (i.e., the tangent plane of the manifold, the local 
directions that preserve the high-probability nature of training examples). Here we 
make a formal link between the target density and models minimizing reconstruction 



error through a contractive mapping, such as the Contractive Auto-Encoder ( Rifai et al. 



201 la| > and the Denoising Auto-Encoder ( Vincent et al. 2008|l. This allows us to justify 



sampling algorithms similar to that proposed by (Rifai et al. 2012} , and apply these 
ideas to Denoising Auto-Encoders as well. 

We define a novel alternative to maximum likelihood training, local moment match- 
ing, which we find that Contractive and Denoising Auto-Encoders perform. This is 
achieved by optimizing a criterion (such as a regularized reconstruction error) such 
that the optimal learned reconstruction function (and its derivatives) provide estimators 
of the local moments (and local derivatives) of the target density. These local moments 
can be used to define an implicit density, the asymptotic distribution of a particular 
Markov chain, which can also be seen as corresponding to an uncountable Gaussian 
mixture, with one Gaussian component at each possible location in input space. 

The main novel contributions of this paper are the following. First, we show in Sec- 
tion[2]that the Denoising Auto-Encoder with small Gaussian perturbations and squared 
error loss is actually a Contractive Auto-Encoder whose contraction penalty is the mag- 
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nitude of the perturbation, making the theory developed here applicable to both, and 
in particular extending the sampling procedure used for the Contractive Auto-Encoder 
to the Denoising Auto-Encoder as well. Second, we present in Section[33]consistency 
arguments justifying the use of the first and second estimated local moments to sample 



from a chain. Such a sampling algorithm has successfully been used in Rifai et al. 



(20121 to sample from a Contractive Auto-Encoder. With small enough steps, we show 
that the asymptotic distribution of the chain has the same similar (smoothed) first and 
second local moments as those estimated. Third, we show in Section [3~4] that non- 
parametrically minimizing reconstruction error with a contractive regularizer yields a 
reconstruction function whose value and Jacobian matrix estimate respectively the first 
and second local moments, i.e., up to a scaling factor, are the right functions to use in 
the Markov chain. Finally, although the sampling algorithm was already empirically 



verified in Rifai et aT| ( 2012| ), we include in Section |4] an experimental validation for 



the case when the model is trained with the denoising criterion. 

2 Contractive and Denoising Auto-Encoders 

The Contractive Auto-Encoder or CAE (Rifai et al. 20 PTa| l is trained to minimize the 
following regularized reconstruction error: 



£cAE — E 



£(x, r(x)) + a 



df(x) 


2 ' 


dx 


F 



(1) 



where r(x) ~ g{f{x)) and ||j4|||. is the sum of the squares of the elements of A. 
Both the squared loss £(x,r) = ^\\x — r\\ 2 and the cross-entropy loss £(x,r) = 
— x\ogr — (1 — x) log(l — r) have been used, but here we focus our analysis on the 
squared loss because of the easier mathematical treatment it allows. Note that success 
of minimizing the above criterion strongly depends on the parametrization of / and g 
and in particular on the tied weights constraint used, with f(x) — sigmoid (Vt^a; + b) 
and g(h) = sigmoid(W /T /i + c). The above regularizing term forces / (as well as g, 
because of the tied weights) to be contractive, i.e., to have singular values less than l[M 
Larger values of a yielding more contraction (smaller singular values) where it hurts 
reconstruction error the least, i.e., in the local directions where there are only little or 
no variations in the data. 

The Denoising Auto-Encoder or DAE (Vincent et al. 2008| l is trained to minimize 



the following denoising criterion: 

£ DAE =E[e{x,r(N(x)))] (2) 

where N(x) is a stochastic corruption of x and the expectation is over the training 
distribution. Here we consider mostly the squared loss and Gaussian noise corruption, 
again because it is easier to handle them mathematically. In particular, if N(x) = x + e 
with e a small zero-mean isotropic Gaussian noise vector of variance a 2 , then a Taylor 



1 Note that an auto-encoder without any regularization would tend to find many leading singular values 
near 1 in order to minimize reconstruction error, i.e., preserve input norm in all the directions of variation 
present in the data. 
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expansion around x gives r(x + e) « r(x) + e, which when plugged into Cdae 
gives 



DAB 



E 

1 
2 

1 



1 / / . . dr(x) 

2 (*" rW + ^ £ 



E [||x-r(a;)|| 2 ] - 2£[e] T E 



a; - ( r(x) + ^M e 



9r(x) 
dx 



(x — r(x)) 



Tr E [ee T ] E 



E [||x-r(a;)|| 2 ] + a 2 E 



dr(x) 



dx 



where in the second line we used the independance of the noise from x and properties 
of the trace, while in the last line we used E [ee T ] = a 2 I and E[e] = by definition of 
e. This derivation shows that the DAE is also a Contractive Auto-Encoder but where the 
contraction is imposed explicitly on the whole reconstruction function r(-) = <?(/(•)) 
rather than on /(•) alone (and g(-) as a side effect of the parametrization). 

2.1 A CAE Sampling Algorithm 

Consider the following coupled Markov chains with elements M t and X t respectively: 



M t+1 = n(X t ) 

Xt+i = M t+ i + Z t+ i. 



(4) 



where M t ,X t ,fi(X t ) € R d and Z t+l is a sample from a zero-mean Gaussian with 
covariance S(X t ). 

The basic algorithm for sampling from the CAE, proposed in Rifai et al. ( 2012| l, is 
based on the above Markov chain operating in the space of hidden representation h = 



f(x), with n(h) = f(g(h)) and Z, 



t+i 



( df(x t ) \ f df{x t ) 
\ dx t J \ dx t 



e, where e is zero-mean 



dr(x) dr(x) 
dx dx 



isotropic Gaussian noise vector in /i-space. This defines a chain of hidden representa- 
tions ht, and the corresponding chain of input-space samples is given by Xt = g{h t ). 
Slightly better results are obtained with this /i-space than with the corresponding x- 

space chain which defines fi(x) — g(f(x)) and Z t +\ = f ^^^ J £ wnere 

e is zero-mean isotropic Gaussian noise in s-space. We conjecture that this advantage 
stems from the fact that moves in /i-space are done in a more abstract, more non-linear 
space. 

3 Local Moment Matching as an Alternative to Maxi- 
mum Likelihood 

3.1 Previous Related Work 

Well-known manifold learning ("embedding") algorithms include Kernel PC A (Scholkopf 



et al. T998JI, LLE (Roweis and Saul 2000| l, Isomap ( |Tenenbaum et al. 2000), Lapla- 
cian Eigenmap (Belkin and Niyogi "|2003|l, Hessian E igenmaps (Donoho and Grimes 



2003 ), Semidefinite Embedding (|Weinberg er and Saul[|20"04"l l, SNE (Hinton and Roweis 
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2003 ) and t-SNE ( van der Maaten and Hinton 2008 1 that were primarily developed and 



used for data visualization through dimensionality reduction. These algorithms opti- 
mize the hidden representation associated with training points in order to best preserve 
certain properties of an input-space neighborhood graph. 

The properties that we are interested in here are the local mean and local covari- 
ance. They are defined as the mean and co variance of a density restricted to a small 
neighborhood. For example, if we have lots of samples and we only consider the 
samples around a point xq, the local mean at Xq would be estimated by the mean 
of these neighbors and the local covariance at xq by the empirical covariance among 
these neighbors. There are previous machine learning algorithms that have been pro- 
posed to estimate these local first and second moments by actually using local neigh- 
bors (|Brand| |2003 1 [Vincent and Bengio 2003 . [Bengio et ah 2006| l. In Manifold Parzen 



Windows ( |Vincent and Be ngio 2003JT this is literally achieved by estimating for each 



test point the empirical mean and empirical covariance of the neighboring points, with 
a regularization of the empirical covariance that sets a floor value for the eigenvalues of 



the covariance matrix. In Non-Local Manifold Parzen (Bengio et al. 2006), the mean 



and covariance are predicted by a neural network that takes xq as input and outputs 
the estimated mean along with a basis for the leading eigenvectors and eigenvalues of 
the estimated covariance. The predictor is trained to maximize the likelihood of the 
near neighbors under the Gaussian with the predicted mean and covariance parame- 
ters. Both algorithms are manifold learning algorithms motivated by the objective to 
discover the local manifold structure of the data, and in particular predict the mani- 
fold tangent planes around any given test point. Besides the computational difficulty 
of having to find the k nearest neighbors of each training example, these algorithms, 
especially Manifold Parzen Windows, heavily rely on the smoothness of the target 
manifolds, so that there are enough samples to teach the model how and where the 
manifolds bend. 

Note that the term local moment matching was already used by Gerber ( 1982) 



an actuarial context to match moments of a discretized scalar distribution. Here we 
consider the more general problem of modeling a multivariate density from data, by 
estimating the first and second multivariate moments at every possible input point. 

3.2 A Sampling Procedure From Local Moment Estimators 

We first show that mild conditions suffice for the chain to converge ^ and then that if 
local first and second moments have been estimated, then one can define a plausible 
sampling algorithm based on a Markov chain that exploits these local moments at each 
step. 

Convergence of the Chain 

This Markov chain X\ , X2 , . . . X t . . . is precisely the one that samples a new point 
by sampling from the Gaussian with these local first and second moments: 

X t+1 ~7V( M (X t ),S(X t )) (5) 



This is also shown in Rifai et al. 2012 
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as in Eq.Q, but where we propose to choose fi(Xi)—Xt proportional to the local mean 
at xq = X t minus X t , and T,(Xi) proportional to the local covariance at xq = X t . The 
functions /i(-) and £(•) thus define a Markov chain. 

Let us sketch a proof that it converges under mild hypotheses, using the decompo- 
sition into the chains of M t 's and of X t 's. Assuming that \/x, fi(x) G B for some 
bounded ball B, then M t G B Vt. If we further assume that £(X t ) is always full rank, 
then there is a non-zero probability of jumping from any M t G B to any M t +i G B, 
which is sufficient for ergodicity of the chain and its convergence. Then if Af t 's con- 
verge, so do their noisy counterparts X t 's. 



Uncountable Gaussian Mixture 



If the chain converges, let ir be the asymptotic distribution of the Xt's. It is inter- 
esting to note that ir satisfies the operator equation 

tt(x) = / ir(x)Af(x; li(x), H(x))dx (6) 



where Af(x; is the density of x under a normal multivariate distribution 

with mean and covariance E(ai). This can be seen as a kind of uncountable 
Gaussian mixture tt where the weight of each component £ is given by tt(x) itself and 
the functions /i(-) and £(•) specify the mean and covariance of each component. 

3.3 Consistency 

From the point of view of learning, what we would like is that the fi(-) and £(■) func- 
tions used in a sampling chain such as Eq. |5]) be such that they yield an asymptotic 
density tt close to some target density p. Because the Markov chain makes noisy finite 
steps, one would expect that the best one can hope for is that tt be a smooth approxi- 
mation of p. 

What we show below is that, in the asymptotic regime of very small steps (i.e. S(x) 
is small in magnitude), a good choice is just the intuitive one, where [i(xq) — xo oc 
E[x|xo] — £o and S(xo) oc Cov(x\xq). 

For this purpose, we formally define the local density ps(x\xo) of points x in the 
^-neighborhood of an x , as 

Ps(x\x ) = (7) 

where Z(xq) is the appropriate normalizing constant. Then we respectively define the 
local mean and covariance around xo as simply being 

mo d =K[x\xo] = J xps(x\xo)dx 

Co d — Cov{x\xq) = J (x — mo)(x — mo) T ps(x\xo)dx. (8) 

Note that we have two scales here, the scale S at which we take the local mean 
and covariance, and the scale a = ||S(xo)|| of the Markov chain steps. To prove 
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consistency we assume here that a -C S and that both are small. Furthermore, we 
assume that /i and £ are somehow calibrated, so that the steps fi(x) — x are comparable 
in size to a. This means that | \fi(x) — x\ \ <C S, which we use below. 

We want to compute the local mean obtained when we follow the Markov chain, 
i.e., in Eq. |7]) we choose a p equal to it (which is defined in Eq. |6]l), and we obtain the 
following: 

■m^ d = E^[x\x ] = — r I x I p{x)N(x]ii(x),ll{x))dxl\\ x _ X0 \\ <s dx 



Z(x ) J x 
1 



p(x) / xN{x] n(x), Y>(x))dxdx. (9) 



Z{xq) J x ^ 1 1 ;c — £C 1 1 <<5 

Because the Gaussian sample of the inner integral must be in a small region inside the 
(5-ball, the inner integral is approximately the Gaussian mean if fj,(x) is in the 5-ball, 
and otherwise: 

/ xN(x\n{x),Y,(x))dx « n{x)l\\^ x )^ Xa \\ <5 . (10) 

J\\x-x \\<5 

which gives 

a J^hK^-xolHS^di- (11) 

Now we use the above assumptions, which give — <C S, to conclude that 

integrating under the region defined by l\\^(x)-x \\<s is equivalent to integrating under 
the region defined by l|u_ a . |i < 5. Hence the above approximation is rewritten 

f p(x^ 

which by the definition of E[-|xo] gives the final result: 

m OT « E[^(x)|x ], (13) 

It means that the local mean under the small-steps Markov chain is a local mean of the 
chain's n's. This justifies choosing fi(x) equal to the local mean of the target density to 
be represented by the chain, so that the Markov chain will yield an asymptotic distri- 
bution that has local moments that are close but smooth versions of those of the target 
density. 

A similar result can be shown for the covariance by observing that the x term in 
Eq. ( 10 1 produced the first moment of the Gaussian and that the same reasoning would 
apply with xx T instead. 

Alternatively, we can follow a shortened version of the above starting with 

C n d = E n [(x - m^){x - m^) T \x ] (14) 
= / p{x) I (a;-m 1 ,)(.T-m 7r ) T 7v r (a;;^(i),I](i))(i4:l5) 

^( x 0) Jx J\\x-x \\<5 
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where 



J\\x-x \\<5 

(S(x) + (//(£) - m„)(n(x) - m 7T ) T ) l|| M ( 5 )_ Xo ||<5- 



(16) 
(17) 



By construction the magnitude of the covariance S(i) was made very small, so the 
following term vanishes 



p(x) 



l||x-x ||<5 S (^)^ -> 



4 Z(iCo)' 

and we are left with the desired result 

Ctt~1E [(p(x) - m v )(n(x) - m 7V ) T \ x ] 



(18) 



(19) 



3.4 Local Moment Matching By Minimizing Regularized Recon- 
struction Error 

We consider here an alternative to using nearest neighbors for estimating local mo- 
ments, by showing that minimizing reconstruction error with a contractive penalty 
yields estimators of the local mean and covariance. 

We start from a training criterion similar to the CAE's but penalizing the contraction 
of the whole auto-encoder's reconstruction function, which is also equivalent to the 
DAE's training criterion in the case of small Gaussian corruption noise (as shown in 
Eq. Q): 



C, 



;lobal 



= / Pi x o) \\ x o - r(x Q )\\ +a 



dr(x ) 
dx 



dxa 



(20) 



where p is the target or training distribution. We prove that in a non-parametric setting 
(where r is completely free), the optimal r is such that the local mean mo is estimated 



by r o d — r ( x o) while the local covariance Co is estimatecjjby Jo d = §^\ XQ - 

To find out what the auto-encoder estimates we follow an approach which has al- 
ready been used, e.g., to show that minimizing squared prediction error E[(/( X)~ Y) 2 ] 
is equivalent to estimating the conditional expectation, f(X) — > E[Y\X}. For this pur- 
pose we consider an asymptotic and non-parametric setting corresponding to the limit 
where the number of examples goes to infinity (we actually minimize the expected er- 
ror) and the capacity goes to infinity: we allow the value r Q and the derivative J of 
r at every point xq, to be different, i.e., we "parametrize" r(x) in every neighborhood 
around xq by 



r(x) = r(x ) + 



Or 
dx 



(x - Xq) =r + J {x - x ) 



(21) 



3 In practice, i.e., the parametric case, there is no guarantee that the estimator be symmetric, but this is 
easily fixed by symmetrizing it, i.e., using 2 ' . 
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which is like a Taylor expansion only valid in the neighborhood of x's around xq, but 
where we actually consider ro and Jo to be parameters free to be chosen separately for 
each so- 

Armed with this non-parametric formulation, we consider an infinity of such neigh- 
borhoods and the local density pg(x\xo) (Eq. |vj» which approaches a Dirac delta func- 
tion in the limit of 6 — >■ 0, and we rewrite /Liobal as follows: 




£ g lohal = J™ / Pi x o) I ( / \\x - r(x)\\ p s (x\x a )dx ) + a 



dr(x ) 



dx 



dxo. 



(22) 

The reason for choosing ps(x\x ) that turns into a Dirac is that the expectations of 
x and xx T arising in the above inner integral will give rise to the local mean mo and 
local covariance Cq. If in the above equation we define r (x) non-parametrically (as per 
Eq. pi)), the minimum can be achieved by considering the separate minimization in 
each x neighborhood with respect to r and J . We can express the local contribution 
to the loss at xq as 



Ci oca i(x ,S) = / \\x - (r + J (x - x ))\\ p s (x\x )dx + a\\J \\ F (23) 

J X 

so that 

global = lim p(x )C locai (xo,5)dx - (24) 

We take the gradient of the local loss with respect to 7'o and Jo an d set it to (detailed 
derivation in Appendix) to get 

d£i oca i(x ,5) 

- = 2(r - m ) +2J {m - x ) 



dr 
dCi oca i(x ,S) 



2a Jq -2{R— m XQ - r a (m - x ) T ) 



dJ 

+2J (i? — m XQ — x m,Q + x x T ) . (25) 
Solving these equations (detailed derivation in Appendix) gives us the solutions 

r = (I - Jo)m-o + JoXo 

J = Coial + Co)- 1 . (26) 

Note that these quantities mo, Co are defined through pg(x\xo) so they depend implic- 
itly on S and we should consider what happens when we take the limit 5 — > 0. 

In particular, when 5 — >• we have that ||Co|| — > and we can see from the 



solutions (26 1 that this forces || Jo| —> 0. In a practical numerical application, we 
fix 6 > to be small and it becomes interesting to see how these quantities relate 
asymptotically (in terms of 8 decreasing). In such a situation, we have the following 
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asymptotically^] 



r x m 

Jo x cT l C . (27) 



Thus, we have proved that the value of the reconstruction and its Jacobian imme- 
diately give us estimators of the local mean and local covariance, respectively. 

4 Experimental Validation 

In the above analysis we have considered the limit of a non-parametric case with an 
infinite amount of data. In that limit, unsurprisingly, reconstruction is perfect, i.e., 
r(x) —¥ x, E[x\xq] — ¥ xq, and || \\f —> and ||Coi>(x|:eo)||f — s> 0, at a speed 
that depends on the scale i5, as we have seen above. We do not care so much about 
the magnitudes but instead care about the directions indicated by r(x) — x (which 
indicate where to look for increases in probability density) and by the singular vectors 
and relative singular values of 9t q^ , which indicate what directions preserve high 
density. In a practical sampling algorithm such as described in Section |2~Tj one wants 



to take non-infinitesimal steps. Furthermore, in the practical experiments of Rifai et al. 



(20121, in order to get good generalization with a limited training set, one typically 
works with a parametrized model which cannot perfectly reconstruct the training set 
(but can generalize). This means that the learned reconstruction is not equal to the 
input, even on training examples, and nor is the Jacobian of the reconstruction function 
tiny (as it would be in the asymptotic non-parametric case). The mathematical link 
between the two situations needs to be clarified in future work, but a heuristic approach 
which we found to work well is the following: control the scale of the Markov chain 
with a hyper-parameter that sets the magnitude of the Gaussian noise (the variance of 
e in Section |2.1[ ). That hyper-parameter can be optimized by visual inspection or by 



estimating the log-likelihood of the samples, using a technique introduced in (Breuleux 



et al. 2011 1 and also used in Rifai et al. ( 2012| l. The basic idea is to generate many 



samples from the model, train a non-parametric density estimator (Parzen Windows) 
using these samples as a training set, and evaluate the log-likelihood of the test set 
using that density estimaor. If the sample generator does not mix well, then some test 
examples will be badly covered (far from any of the generated samples), thus incurring 
a high price in log-likelihood. If the generator mixes well but smoothes too much the 
true density, then the automatically selected bandwidth of the Parzen Windows will be 
chosen larger, incurring again a penalty in test log-likelihood. 

In Fig.|4] we show samples of DAEs trained and sampled similarly as in Rifai et al. 



(2012) on both MNIST digits images and the Toronto Face Dataset (TFD) 



et al 



2010| l. These results and those already obtained in Rifai et al. (2012 



Susskind 



confirm 

that the auto-encoder trained either as a CAE or a DAE estimates local moments that 
can be followed in a Markov chain to generate likely-looking samples (and which have 



been shown quantitatively to be of high quality in Rifai et al. (2012 1). 

4 Here, to avoid confusion with the overloaded ~ notation for sampling, we instead use the X notation to 
denote that the ratio of any coefficient on the left with its corresponding coefficient on the right goes to 1 as 
5^0. 
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Figure 1: Samples generated by a DAE trained on TFD (top 2 rows) and MNIST 
(bottom 2 rows). 



5 Conclusion 

This paper has contributed a novel approach to modeling densities: indirectly, through 
the estimation of local moments. It has shown that local moments can be estimated 
by auto-encoders with contractive regularization. It has justified a sampling algorithm 
based on a simple Markov chain when estimators of the local moments are available. 
Whereas auto-encoders are unsupervised learning algorithms that have been known for 
many decades, it has never been clear if they captured everything that can be captured 
from a distribution. For the first time, this paper presents a theoretical justification 
showing that they do implicitly perform density estimation (provided some appropri- 
ate regularization is used, and assuming the training criterion can be minimized). This 
provides a more solid footing to the recently proposed algorithm for sampling Con- 
tractive Auto-Encoders (Rifai et al. , 2012) and opens the door to other related learning 
and sampling algorithms. In particular, it shows that this sampling algorithm can be 
applied to Denoising Auto-Encoders as well. 

An interesting advantage of modeling data through such training criteria is that 
there is no need to estimate an untractable partition function or its gradient, and that 
there is no difficult inference problem associated with these types of models either. Fu- 
ture work following up on this paper should try to answer the more difficult mathemat- 
ical questions of what happens (e.g., with the consistency arguments presented here) if 
the Markov chain steps are not tiny, and when we consider a learner that has parametric 
constraints, rather than the asymptotic non-parametric limit considered here. We be- 
lieve that the approach presented here can also be applied to energy-based models (for 
which the free energy can be computed), and that the local moments are directly related 
to the first and second derivative of the estimated density. Future work should clarify 
that relationship, possibly giving rise to new sampling algorithms for energy-based 
models (since we have shown here that one can sample from the estimated density if 
one can compute the local moments). Finally, it would be interesting to extend this 
work in the direction of analysis that explicitly takes into account the decomposition 
of the auto-encoder into an encoder and a decoder. Indeed, we have found experimen- 
tally that sampling in the representation space gives better results than sampling in the 
original input space, but a more solid mathematical treatment of such algorithms is still 
missing. 
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A Derivation of the local training criterion gradient 

We want to obtain the derivative of the local training criterion, 



iiocal = lim / \\x - (r + J (x - x )\\ 2 p s (x\x )dx + a\\ J \\ 2 F . (28) 



volume 8. 




We use the definitions 




i?o = E[xx T |a;o] 

C = Cov(x\x ) = E[(x - hq)(x - Ho) T \x Q ] = R - ^ Mo • 
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We first expand the expected square error: 

MSE = E[\\x - (r + Jo(x - x Q ))\\ 2 \x ] = E[(x - {r Q + J {x - x ))(x - {r + Jo(x - x a )) T ] 

= E[(x - r )(x - r ) T - 2(x - r ) T J (x - x )} 

= +E[(x - x Q ) T j£ J (x - x a )}. 

Differentiating this with respect to r yields 

dMSE n , . n T . 

= -2(,uo - r ) + 2J (m - x ) 



dr 



corresponding to Eq.(25) of the paper. 

For differentiating with respect to Jo, we use the trace properties 

P|| 2 F = Tr(AA T ) 
Tr(ABC) = Tr(BCA) 



dTr(A T XAZ) 
dA 



XAZ + X T AZ T 



dA 
dTr{XA T ) 
dA 

We obtain for the regularizer, 



dTr(XA) xT 



X. 



dJ 

and for the MSE: 



OMSE = _ m 



d 

^--ir(J (a; - x )(x - r ) T ) 



+ E 



d 

---tr(Jo J (x - x Q ){x - x ) T ) 

OJq 



dJ 

= -2E [(x - r )(x - x ) T ] + 2J E [(x - x ){x - x ) T ] 

= —2 (J2 — fiQX^ - r (p - x ) T ) + 2J (i? - ^o^o" _ x oMo" + x oXq T ) 

B Detailed derivation of the minimizers of the local train- 
ing criterion 

Starting with r , we solve when the gradient is to obtain 

dMSE , . . . n 

d = (M - r o) - MP - x ) = 

jU-r = J (p-x ) 
r = n-J (fj,-x ) 
r = (I - J q )/j, + J x . 
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Substituting that value for r into the expression for the gradient with respect to Jo, 
we get 



dMSE 
dJ 



= —2 (-R — no^o - (Mo - M^o - xo)) (Mo - ^o) T ) + 2J (R - A*o^o _ x oMo" + x o x o T ) 

= —2 (J2 — MoA*o ) _ 2 J (/io - £o)(mo - «o) T + 2J (R - Mo#o - x oVo + x o x o T ) 
= -2(R- Mo ^o ) + 2 Jo (i? - MoMo ) 
= -2(1 -J ) C Q . 

Adding the regularizer term and setting the gradient to 0, we get 

ejMSEpiMM _ _ 2(; _ Jo)a + 2aJo = 

dJo 

Co = Jod + a Jo 

C = J (Co + a/) 

Jo - CoM + Co)- 1 

which altogether gives us Eq.(26) from the main text: 

r = (I - J )^o + JqXo 
Jo = (a/ + Co)- 1 C 

Note that we can also solve for /j, 

H = (I - Jo) _1 (r - Jo^o) 

and for Co: 

(7-J )C = aJ Q 

C = a(/-J )- 1 Jo 

However, we still have to take the limit as S — > 0. Noting that the magnitude of Co 
goes to as S — > 0, it means that J also goes to in magnitude. Plugging in the above 
equations gives the final results: 

TO = M0 

Co — Jo 
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